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Abstract 

The observed structure of protein interaction networks is corrupted by many false 
positive/negative links. This observational incompleteness is abstracted as random 
link removal and a specific, experimentally motivated (spoke) link rearrangement. 
Their impact on the structural properties of gene-duplication-and-mutation network 
models is studied. For the degree distribution a curve collapse is found, showing no 
sensitive dependence on the link removal/rearrangement strengths and disallowing 
a quantitative extraction of model parameters. The spoke link rearrangement pro- 
cess moves other structural observables, like degree correlations, cluster coefficient 
and motif frequencies, closer to their counterparts extracted from the yeast data. 
This underlines the importance to take a precise modeling of the observational in- 
completeness into account when network structure models are to be quantitatively 
compared to data. 
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1 Introduction 



Recent advances in the identification of protein interactions [1,2,3] have greatly 
extended their number in actual datasets [4,5,6,7]. The accumulated knowl- 
edge about these complex mutual interactions of single proteins is represented 
in protein interaction networks where proteins are represented by nodes and in- 
teractions by links between respective nodes. Investigations on the topological 
structure of the network graphs contribute significantly to the understand- 
ing of the organizational principles and evolutionary strategies behind such 
complex interaction networks. 

Several models have been proposed for the modeling of the structural evolu- 
tion of protein interaction networks [8,9,10,11,12]. All of them are based on 
the idea of gene duplication and mutation to be the responsible mechanism for 
the evolution from a small number of proteins up to several thousands known 
today. This mechanism, where links of highly connected nodes are more likely 
to be duplicated, is a biological representation of network growth with pref- 
erential attachment [13]. The model described in [12] fits best to observed 
properties of real yeast interaction networks, extracted for example from the 
DIP database [4]. During one evolutionary step of this model (see Fig. 1) a 
randomly selected node is copied with all its links. With probability 5 each 
of the copied links is then subject to removal, and with probability p a new 
(homodimer) link is established between original and copied node. If after the 
probabilistic link removals and addition the copied node is left without any 
link, it is deleted. Fig. 2 illustrates the dependence of the degree distribution 
on the model parameter 5. As for the giant component of real yeast data [4], 
the number of nodes has been set to N gc = 4687. The parameter p — 0.1 has 
been estimated according to the number of homodimers in real yeast datasets. 
It has no significant influence on the degree distribution. For 5 = 0.58 the 
degree distribution matches its data counterpart. Other network properties, 
like degree correlation, cluster coefficient and selected motifs, also agree with 
data to some extend (see Fig. 4). 

Although the simple gene-duplication-and-mutation mechanism disregards any 
selection process and does not take further regulatory mechanisms into ac- 
count, it gives us a principle understanding of how evolution went to work. 
However, caution should be taken when it comes to a biological interpretation 
of the fitted model parameter values. One has to take into account that the 
actual data of protein interactions contains a large number of false links. In 
Refs. [14,15,16] different methods are applied to provide an estimate about 
the amount of links which are set in the real yeast datasets but do not exist 
(false positives), and those which exist but are not contained in the dataset 
(false negatives). Mostly by comparing high confidential with high throughput 
datasets, current estimates are that the total number of interactions is up to 
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30,000 compared to about 15,000 known today [14,16] and that within these 
15,000 interactions 50% of the links are wrongly assigned [15,16]. 

Given this amount of observational incompleteness, two driving questions 
emerge: can we compare a model like [12] "as is" with data, and how rele- 
vant are fitted model parameters? The aim of this Paper is to analyze how 
much and in what directions various forms of observational incompleteness 
modify extracted structural properties of protein interaction networks, like 
degree distribution, degree correlations, cluster coefficient and motifs. 

In abstracted form, the observational incompleteness leading to the occurrence 
of false negative / positive links can be modeled as link removal / rearrange- 
ment applied to an initial "true" network. The simplest variant is completely 
random removal / rearrangement of links. Its effect on scale-free networks 
has already been discussed [17]. Another form of link removal is subnetwork 
sampling, like snowball sampling [18,19], truncated random walk sampling or 
traceroute exploration [20]. 

Throughout this Paper we will make use of the network model of Ref. [12]. 
It serves to synthetically generate "true" protein interaction networks. Sect. 2 
discusses completely random deletion of links. Sect. 3 introduces a very specific 
random link rearrangement, which is directly motivated from the experimen- 
tally applied complex purification methods [1,2]. A conclusion and an outlook 
is given in Sect. 4. 



2 Random link removal 

Random link removal represents the simplest modeling to introduce false neg- 
ative links. A network with N nodes and L links is considered to be the initial 
"true" network. One after the other a link is selected randomly and then re- 
moved from the network. The removal strength v = AL/L counts the relative 
number of deleted, i.e. false negative links. The impact of this random link 
removal on the degree distribution p(k) of the gene-duplication-and-mutation 
network of Ref. [12] is shown in Fig. 3a. With increasing removal strength the 
resulting degree distribution deviates more and more from its initial counter- 
part. 

Admittedly, the degree distributions of Fig. 3a with 5 = 0.58 and v > 
resemble those of Fig. 2 with 5 > 0.58 and v = 0. In fact, the shown degree 
distributions with v = 0.2, 0.4, 0.6, 0.8 match those resulting from 5 = 0.62, 
0.66, 0.73, 0.85, respectively. By looking at the degree distribution only, a 
network with specific 5, but subject to random link removal, appears like a 
network corresponding to a larger 5. 
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Note however, that this comparison is not mature. With v > the size N gc 
of the giant network component, from which all degree distributions of Fig. 
3a have been sampled, is reduced. For 5 = 0.58 and v = 0.2, 0.4, 0.6, 0.8 it 
results in 7V gc « 4400, 4000, 3350, 2100, respectively. By model construction 
the initial size N gc (u=0) = N = 4687 is independent of the parameter S. 
Consequently, the number of nodes contained in the giant component does 
not agree between the link-removed model network and the reparametrized 
initial model network, although their degree distributions match. 

For a proper comparison the model network reduced by random link removal 
should end up with the same average degree (k) and the same size N gc for 
the giant component as the reparametrized initial network model. For refer- 
ence, we choose (k) = 6.47 and iVg C = 4687 as observed in the yeast data 
[4]. This requires the model network to have initially more nodes and links 
before random link removal sets in. Initial numbers of nodes and links are not 
independent of each other and require a careful tuning, so that after random 
link removal a precision landing is made at the targeted (k) and iVg C . For 
example, for removal strengths v = 0.2, 0.31, 0.395 the rescaled parameters 
are (N, 5) = (4950,0.55), (5100,0.53), (5250,0.51). The remaining parameter 
p = 0.1 has been kept fixed. Note, that even larger removal strengths are not 
feasible for the chosen network model. It would require 5 < 0.5. In this regime 
the model is not self- averaging any longer [11]. 

Fig. 3b illustrates the degree distributions obtained after random link removal 
has been applied to the parameter-rescaled model realizations. All distribu- 
tions corresponding to different removal strengths collapse to one single curve. 
This curve collapse is somewhat surprising, because by construction only the 
size of the resulting giant network component and the resulting average degree 
have been set the same. Each curve results from the interplay of two effects: 
initially, i.e. before random link removal sets in, a smaller 5 leads to a flatter 
degree distribution (see again Fig. 2), which is then, once random link removal 
sets in, turned into a steeper distribution (see again Fig. 3a). 

If the resulting degree distributions had all been Poissonians, then the curve 
collapse would have been straightforward to understand. The rate equation 
for random link removal [17] 

dp k k + 1 k 
du (1 — v) (1 — v) 

is solved byp fc = (A fc /A;!)e~ A with A = (k) — 2L(\ — v)/N. A Poissonian degree 
distribution remains Poissonian, although with rescaled parameter A. Hence, 
a Poissonian network constructed with (k) = A can also be obtained by first 
constructing a denser Poissonian network with (k) = A/(l — u), which is then 
subject to random link removal of strength v. 
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Also for scale-free distributions pk ~ k^ 1 the curve collapse can be con- 
structed with a rescaling of model parameters. In case of a growth process 
with preferential attachment n ~ k+X, the model parameters are the num- 
ber m of open links, with which a new node enters the network, and the 
attractiveness A [21]. They determine the scale-free exponent 7 = 3 + X/m. 
Now, Ref. [17] has shown that during the preferential-detachment-like ran- 
dom link removal, where the initial average degree (k) = 2m is reduced, the 
scale-free exponent is conserved in the large-/c regime. This implies that after 
random link removal with strength v the resulting network appears like one 
which has been grown with rescaled model parameters m resca ied = (1 — v)m 

and A re scaled = (^rcscalcd/^) A = (l~v)X. 

Although the small excursions to Poissonian and scale-free networks have shed 
some light on the nature of the curve collapse, its appearance in connection 
with gene-duplication-and-mutation networks remains without a deeper expla- 
nation. Nevertheless, from a pragmatic point of view we can say the following: 
if we consider a gene-duplication-and-mutation network as the "true" network 
and introduce false negatives in the form of random link removal, then the 
resulting degree distribution appears like one obtained from the same gene- 
duplication-and-mutation process, but with different parameters. It would be 
inappropriate to give a biological interpretation to the magnitude of the ex- 
tracted parameters. 

The curve collapse motivates to look at observables beyond degree distribu- 
tion. The average degree {k ng b\k) for neighbors of a node with degree k repre- 
sents a measure for degree correlations. Fig. 4a illustrates its dependence on 
the removal strength. The same procedure with rescaled model parameters as 
for Fig. 3b has been applied. With increasing v the degree correlations are re- 
duced to some minor extend. They stay close to the v — model correlations. 
The comparison with the correlations observed in the yeast data makes clear 
that all curves corresponding to different v more or less match with the same 
quality. 

A similar finding is obtained for the degree-dependent cluster coefficient C(k). 
It represents the fraction of triangles formed by a node with degree k and its 
neighbors out of the maximum possible number k(k — l)/2. Fig. 4b shows its 
dependence on the removal strength. C(k) decreases with increasing u, but 
remains within the same order of magnitude as for v — 0. Compared to the 
yeast data, all v curves are too low and no one of them is really to be favored 
over the other ones. 

A corresponding conclusion can also be drawn from an analysis based on 
motif structures. A variety of motif systematics has been discussed in the 
literature [22,23,24,25]. Our selected set is depicted in Fig. 5. It is restricted 
to triangles, squares and pentagons with different intra-link structure. The 
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loops within these motifs represent potential regulatory mechanisms. The total 
number M = ^motifs -^motif of all selected motifs as well as their relative 
frequencies M moti f/M have been determined in dependence on the random 
link removal strength. Fig. 4c reveals that the total number decreases with v. 
At v 0.2 it matches its yeast data counterpart. Within the model the three 
dominant contributions come from the motifs "sqr", "pent" and "pentl". The 
relative frequency of "sqr" basically remains independent of is, but noticeably 
overestimates the frequency extracted from the yeast data set. With increasing 
removal strength the relative frequency of "pent" increases slightly, whereas 
that of "pentl" decreases to some small extend. Both more or less agree with 
their yeast data counterparts. Except for "pent2b", no agreement is reached 
for the relative frequencies of all "pent" motifs with more than one intra link. 
The significant model underestimations hold for all link removal strengths. 



3 Spoke link rearrangement 

So far the modeling of observational incompleteness has only taken subnetwork 
sampling in the form of random link removal into account. In this way only 
false negative links have been created. For the modeling of false positive links 
some kind of link rearrangement or link addition is needed. We will now discuss 
a very specific random link rearrangement, which is directly motivated from 
the shortcomings in the generation and interpretation of protein-interaction 
data. 

Using the complex purification methods, namely affinity precipitation and 
affinity chromatography [1,2], a protein is tagged and placed into the cell lysis. 
The tagged protein (bait) is then isolated and analyzed with its associated 
proteins (preys). It is not obvious how to assign links between the bait and 
preys found in the protein complex. In the commonly used spoke algorithm 
[26] direct links are defined between the bait and all its preys. This approach 
is illustrated in Fig. 6. It does not take into account the possibility that the 
bait is not directly interacting with all preys but via intermediate proteins. 
This results in false positive and negative links (Fig. 6a). Moreover, possible 
interactions between the prey proteins themselves are also not taken care of, 
resulting in even more false negative links (Fig. 6b). Similar effects occur with 
the yeast-two-hybrid [16] and the synthetic lethality methods [3]. Although 
the yeast-two-hybrid method characterizes the interaction between two target 
proteins, no assurance can be given that this interaction is not provided by 
an intermediate protein. With the synthetic lethality method an interaction is 
assumed between two functional correlated proteins but even if they are part 
of the same complex, it is not clear if a direct interaction exists. 

To study the influence of this effect on the network topology we propose a 
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local random link rearrangement, which hereafter is called spoke link rear- 
rangement. After selection of an initial (bait) node, one of its first (prey) 
neighbors is chosen at random. The latter then continues to randomly choose 
one of its first neighbors, excluding of course the initial node. Two cases then 
have to be distinguished. If the last node is a second neighbor of the bait node, 
a new, but then false-positive link between these two nodes is introduced and 
the old link between the two prey nodes is removed to gain false-negative sta- 
tus; see again Fig. 6a. In the other case, the second prey node turns out to be a 
first neighbor of the bait node, upon which only the link between the two prey 
nodes is removed and becomes false-negative; see again Fig. 6b. - Due to the 
second case, the spoke link rearrangement is not a pure link rearrangement. 
However, the cluster coefficient is small enough to keep the link removal part 
small (see Fig. 4b and third row of Fig. 7). 

So far the selection of bait nodes in the spoke link rearrangement has not 
been specified. In the yeast data [4], the bait proteins are of course known 
and make up approximately a quarter of all listed protein nodes. In general 
they have a larger degree than the overall average. Their degree distribution 
Pbait(^) is different from the observed overall degree distribution p(k), but can 
be mapped onto the latter via 



with a ~ 0.3. This indicates that a bait node i with degree h might be picked 
from the model network with the preferential bias 



Since the observed degree distribution p k entering (2) is most likely not equal 
to the unknown true one, we will discuss probabilistic bait selection with a = 
and 1 in the following. 

The combination of the biased bait selection (3) and the spoke link rear- 
rangement process are applied to the network structure obtained with the 
gene-duplication-and-mutation model of Ref. [12]. Again, model parameters 
are taken to match the yeast data, i.e. N = 4687, 5 = 0.58 and p = 0.1. 
The rearrangement strength v = AL/L counts the relative number of bait 
selections implying link rearrangement or removal. 

Since link removal is included in the spoke link rearrangement, the average 
degree decreases with increasing v from its initial value (k) = 6.47. For the 
already very large rearrangement strength v = 0.8 we arrive at the slightly 
reduced values (k) = 6.40 and 5.97 for a = and 1, respectively. Note also, 
that the giant component of the network does not change with v and remains 
at its initial value N gc = N. Both subprocesses of the spoke link rearrangement 



Pbait(fc) ~ k a p k 



(2) 



Ilbait(fci) 




(3) 
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always keep the three involved nodes connected to the overall network. This 
observation together with the decreasing average degree would imply, that 
for a fair comparison between the yeast data and the spoke-link-rearranged 
model a reparametrization of the model parameter 5 is required to start with 
an initially denser network. Since the reduction of the average degree remains 
rather small for modest, data-relevant rearrangement strengths, we abandon 
to do so. 

The first row of Fig. 7 shows the dependence of the degree distribution on 
the rearrangement strength v. For a = the probability to find low- and 
high-degree nodes decreases with increasing v. Already for small < v < 
0.5 the deviations to the initial degree distribution are noticeable. For very 
large v the degree distribution appears to converge towards a Poissonian. 
For a = 1 the outcome is different. For small v the deviations to the initial 
degree distribution are barely noticeable, leading to a curve collapse in good 
approximation. Even for large v = 5 the resulting p k is still close by. 

Analytical insight into these findings can be obtained from the following rate 
equation: 



2 dp k (u) , (k-l) a ( k a 



k + 1 k 
+(! - <W-^-Pfc+i(f) - (1 - hi)j^Pk(v) ■ (4) 

The first two terms on the right-hand side represent the gain and loss term 
of the selected bait, which increases its degree by one. The third and fourth 
term describe the first neighbor of the bait, which looses a link. Consult again 
Fig. 6a. The spoke link removal (see Fig. 6b) has been neglected in (4). Fur- 
thermore, also degree correlations have been discarded. 

For the case a = the stationary solution of (4) is 

found to be the modified Poissonian 

„ (*>i) (5) 

where p = is required from iVg C = iV for all v. This confirms the simulational 
finding of Fig. 7a for very large v. - For a = 1, Pk ~ k -1 represents the non- 
normalizable stationary solution of the rate equation. However, due to the 
finiteness of the network a cutoff k c may be introduced, leading to 

Pk = r {k - 1] (6) 

(fc = o) . 1 ; 
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With parameters a = 0.35 and k c = 18 this solution is also illustrated in Fig. 
7b and its inset. It agrees nicely with p k {v = 5) obtained from the simulations. 

The degree correlation in the form (k ngh \k) is illustrated in the second row 
of Fig. 7. In case of a = and especially for very small degrees, the average 
neighbor degree rapidly decreases with increasing rearrangement strength. Its 
initial disassortative character is turned into a randomized one, which is re- 
flected in the /c-independence. The average neighbor degree associated with 
a — 1 does show a similar, but weaker dependence on the rearrangement 
strength. 

The third row of Fig. 7 focuses on the degree-dependent cluster coefficient. 
Not much happens for a = at modest rearrangement strengths v < 0.7, 
except for very low degrees, where the cluster coefficient decreases as k — > 0. 
This behavior is also observed in the yeast data. In case of the biased bait 
picking with a — 1, the cluster coefficient increases for all k as the re- 
arrangement strength increases from v — to v pa 0.3, only then to de- 
crease again for even larger v. At the turning point v ~ 0.3, the found 
degree-dependent cluster coefficient almost matches its counterpart from the 
yeast data. - The overall cluster coefficient declines with increasing v = 
0.1, 0.3, 0.5, 0.7, 5 as (C) = 0.14, 0.11, 0.08, 0.06, 0.002 for a = and as 
(C) = 0.20, 0.19, 0.16, 0.12, 0.008 for a = 1. For very large rearrangement 
strengths it becomes very small. 

The motifs of Fig. 5 are exemplified in the last row of Fig. 7. For a = their 
total number is a strictly decreasing function with v. The case a — 1 shows a 
different behavior. For v = to about v ~ 0.2 it is first an increasing function 
and then becomes a decreasing function beyond this point. Compared to the 
yeast data, the order of magnitude is right for both a-values. The relative 
frequency of the motif 'sqr' decreases with v. Characteristic trends are also 
observed for other motifs, but it is difficult to provide a solid explanation 
for it. Just by looking at the two subfigures, we have the impression that for 
the combination a — 1, v w 0.3 the distribution of relative motif frequencies 
comes closest to the yeast distribution. 

We have arrived at a remarkable result: in comparison to the respective yeast- 
data counterparts, the degree distribution resulting from the parameter com- 
bination a = 1 and v pa 0.3 of the spoke link rearrangement perfectly matches, 
the found degree correlation is at least close, the degree-dependent cluster co- 
efficient matches close to perfect and also the distribution of relative motif 
frequencies comes very close. Compared to the initial model at v — 0, the 
agreement with data has improved. 

Without showing, we have also looked at a combined application of spoke 
link rearrangement and random link removal. Results turn out to be a mere 
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superposition of those obtained independently in this and the previous section. 



4 Conclusion 

Observed protein interaction networks are known to be corrupted by a large 
amount of false negative and false positive links. This observational incom- 
pleteness impacts the analysis of network structure. We have assumed the 
emergence of false negative links to be random and have modeled it with 
a random subnetwork sampling like random link removal. Most of the false 
positive links arise due to an operationally defined link assignment during mea- 
surements. The latter has been abstracted as a specific random (spoke) link 
rearrangement process. The modeling of both forms of observational incom- 
pleteness reveals that the resulting degree distributions either do not depend 
at all or only weakly on the applied link removal / rearrangement strengths. 
Based on this curve collapse alone, no judgment can be made on the quali- 
ties of the underlying gene-duplication-and-mutation network models and no 
biological interpretation should be given to respective model parameters, like 
the mutation rate 5. 

For observables beyond degree distribution, like degree correlation, cluster 
coefficient and motif frequencies, a dependence on the applied link removal 
/ rearrangement strength is found. Whereas for random link removal this 
dependence remains small, spoke link rearrangement appears to move these 
observables closer to their counterparts extracted from the DIP database. 
This shows the importance to include observational incompleteness into the 
comparison between network models and data. It should also be included into 
any systematic identification of statistically significant network measures [25] 
and will gain more interest, the more rigorous the analysis of relevant data 
becomes. 
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a) b) c) d) 

Fig. 1. The gene-duplication-and-mutation model of Ref. [12]: a) random selection 
of a node, b) copy of this node with all of its links, c) deletion of copied links with 
probability 5, d) introduction of a new link between original and copied node with 
probability p. 




Fig. 2. Degree distribution resulting from the network model proposed in Ref. [12] 
for various parameter values 5. The other parameters have been set to N = 4687 
and p = 0.1. The value 5 = 0.58 fits best to the yeast protein interaction data taken 
from the DIP database [4]. 
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Fig. 3. Degree distributions for various random link removal strengths v. (Top) 
Parameters of the gene-duplication-mutation model [12] have been set to N = 4687, 
5 = 0.58, p = 0.1, such that (k) ~ 6.47 for v = 0. (Bottom) Model parameters have 
been chosen such that the size of the giant component and the average degree 
become N gc ~ 4687 and (k) ~ 6.47 after link removal; for v = 0.2, 0.31, 0.395 
rescaled parameter values are (N,S) = (4950,0.55), (5100,0.53), (5250,0.51), and 
p = 0.1. The various degree distributions have been sampled from the respective 
giant components of 50 independent network realizations. 
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Fig. 4. (a) Degree correlation, (b) degree-dependent cluster coefficient and (c) 
relative frequencies of selected motifs (see Fig. 5) for various random link removal 
strengths v. Model parameters are the same as in Fig. 3b. The various distributions 
have been sampled from the respective giant components of 50 independent network 
realizations. For comparison respective distributions obtained from the yeast protein 
interaction database [4] are also shown. 
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Fig. 5. Motifs used for the analysis. 




Fig. 6. Complex purification methods may lead to wrong link assignments, (a) Bait 
protein A and prey proteins B, C, D bind for a complex. Assigned links reflect the 
bait-prey relationship. However, A does not directly bind to C (red false positive). 
It is B, which binds to C (blue false negative), (b) For the complex ABC links are 
only assigned between bait A and preys B, C. Link B-C is missed, resulting in a 
(blue) false negative. 
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Fig. 7. (Row 1) Degree distribution, (row 2) degree correlation, (row 3) de- 
gree-dependent cluster coefficient and (row 4) relative motif frequencies for 
(left /right column) a = / 1 and various spoke link rearrangement strengths v. 
The parameters of the initial gene-duplication-and-mutation model [12] have been 
set to N = 4687, 5 = 0.58 and p = 0.1. The various distributions have been sampled 
from 50 independent network realizations. For comparison, respective distributions 
extracted from the yeast protein interaction database [4] are also shown. The an- 
alytical degree distributions (5) and (6), which have been obtained in the large-z^ 
limit, are also illustrated in the left and right part of the first row, respectively. 
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